AML2019
Anomaly detection (AD) refers to the process of detecting data points that do not conform with the rest of observations. Applications of anomaly detection include fraud and fault detection, surveillance, diagnosis, data cleanup, predictive maintenance.
When we talk about AD, we usually look at it as an unsupervised (or semi-supervised) task, where the concept of anomaly is often not well defined or, in the best case, just few samples are labeled as anomalous. In this challenge, we will look at AD from a different perspective!
The dataset we are going to work on consists of monitoring data generated by IT systems; such data is then processed by a monitoring system that executes some checks and detects a series of anomalies. This is a multi-label classification problem, where each check is a binary label corresponding to a specific type of anomaly. Our goal is to develop a machine learning model (or multiple ones) to accurately detect such anomalies.
This will also involve a mixture of data exploration, pre-processing, model selection, and performance evaluation. We will also try one rule learning model, and compare it with other ML models both in terms of predictive performances and interpretability. Interpreatibility is indeed a strong requirement especially in applications like AD where understanding the output of a model is as important as the output itself.
The data for this challenge is located at: /mnt/datasets/anomaly
You have a unique csv file with 36 features and 8 labels. Each record contains aggregate features computed over a given amount of time.
A brief outline of the available attributes is given below.
Labels are binary. Each label refers to a different anomaly.
The very first task of a building model task is to understand the data. In this section we will load, visualize and explore the meaning of the given data.
Firstly we need to import some necessary packages:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
# Display all the columns
pd.options.display.max_columns = None
A quick check on the given data shows that it's a csv file without headers. To help the analysic process more easier, we will define the columns name as below:
# The base path lead to the data file
base = "/mnt/datasets/anomaly"
# Columns name in the same order with the data source
features = np.array([
"SessionNumber",
"SystemID",
"Date",
"HighPriorityAlerts",
"Dumps",
"CleanupOOMDumps",
"CompositeOOMDums",
"IndexServerRestarts",
"NameServerRestarts",
"XSEngineRestarts",
"PreprocessorRestarts",
"DaemonRestarts",
"StatisticsServerRestarts",
"CPU",
"PhysMEM",
"InstanceMEM",
"TablesAllocation",
"IndexServerAllocationLimit",
"ColumnUnloads",
"DeltaSize",
"MergeErrors",
"BlockingPhaseSec",
"Disk",
"LargestTableSize",
"LargestPartitionSize",
"DiagnosisFiles",
"DiagnosisFilesSize",
"DaysWithSuccessfulDataBackups",
"DaysWithSuccessfulLogBackups",
"DaysWithFailedDataBackups",
"DaysWithFailedfulLogBackups",
"MinDailyNumberOfSuccessfulDataBackups",
"MinDailyNumberOfSuccessfulLogBackups",
"MaxDailyNumberOfFailedDataBackups",
"MaxDailyNumberOfFailedLogBackups",
"LogSegmentChange",
])
# List of anomaly types
labels = np.array([
"Check1",
"Check2",
"Check3",
"Check4",
"Check5",
"Check6",
"Check7",
"Check8"])
# load data using predefined headers and character ; as the delimiter
data = pd.read_csv(base + '/data.csv', sep = ';', header=None, names = np.append(features, labels))
Check the shape of our data, columns information and show its first 10 records:
# Display the first 10 record
print ("\nDisplay the first 10 record")
display(data.head(n=10))
# Display the number of entries, columns, its corresponding name and dtype
print ("\nDisplay the data information")
data.info()
Our data contains 287,031 logs with 8 types of anomaly and 36 features. The data is in the form of numeric except column Date.
The data description shows that there are some features having constrain in their values such as the number of memory dumps should be a positive integer. We will check if there are any out of range values in our dataset.
First we should convert the Date column into Date type:
#Handle the date
data['Date'] = pd.to_datetime(data['Date'], format = "%d/%m/%Y %H:%M")
print("Date type: ", data.Date.dtype)
Show the values range for each feature/label:
# Show value range for each feature
for f in np.append(features, labels):
# print("%s" % f)
# display(data[f].describe())
# Don't take account nan values
f_values = data[f].dropna()
print("%s: [%s , %s] with %s unique values " % (f, min(f_values), max(f_values), f_values.unique().size))
We have compared these information with the data description and had some observations as below:
Firstly we check features which have values out of range: CPU, Physical Memory, Disk. All these features should have values in range of [0, 100], however they have very large values compared to 100. We know that these figures are sensitive to system behavior, like a very high memory usage could indicate an anomaly/error. Let's check the distribution of these features and see that they are anomalies or not.
tmp_features = np.array(["CPU", "PhysMEM", "Disk"])
for f in tmp_features:
# Data without NaN values in feature f
df = data.dropna(subset = [f])
# Data with f value < 100
in_range_records = df.loc[df[f] <= 100]
# Data with f value > 100
out_of_range_records = df.loc[df[f] > 100]
print("%s records with %s smaller than 100 " % (in_range_records.shape[0], f))
print("%s records with %s larger than 100 " % (out_of_range_records.shape[0], f))
# Get the labels of data with f > 100
out_of_range_records = out_of_range_records.iloc[:,36:44].fillna(0)
print("%s anomaly detections with %s larger than 100 " % (out_of_range_records.max(axis = 1).sum(), f))
print("============================")
print("Box plot for these features:")
f = pd.melt(data, value_vars=tmp_features)
g = sns.FacetGrid(f, col="variable",sharex=False, sharey=False)
g = g.map(sns.boxplot, "value")
As our thought, 100% out-of-range CPU and Memory records indicate anomaly while this percent in Disk is around 50%.
Now let's check the last field which have out-of-range value: LogSegmentChange
tmp_features = np.array(["LogSegmentChange"])
for f in tmp_features:
# Data without NaN values in feature f
df = data.dropna(subset = [f])
# Data with f value < 100
in_range_records = df.loc[df[f] >= 0]
# Data with f value > 100
out_of_range_records = df.loc[df[f] < 0]
print("%s records with %s larger than 0 " % (in_range_records.shape[0], f))
print("%s records with %s smaller than 0 " % (out_of_range_records.shape[0], f))
# Get the labels of data with f > 0
in_range_records = in_range_records.iloc[:,36:44].fillna(0)
print("%s anomaly detections with %s larger than 0 " % (in_range_records.max(axis = 1).sum(), f))
# Get the labels of data with f < 0
out_of_range_records = out_of_range_records.iloc[:,36:44].fillna(0)
print("%s anomaly detections with %s smaller than 0 " % (out_of_range_records.max(axis = 1).sum(), f))
print("============================")
print("Distribution plot for these features:")
f = pd.melt(data, value_vars=tmp_features)
g = sns.FacetGrid(f, col="variable",sharex=False, sharey=False)
g = g.map(sns.distplot, "value")
According to the indicator, we see that there are more than 50% records with negative LogSegmentChange are anomalies. The corresponding ratio on positive LogSegmentChange is about 30%.
In this section we will have a look on types of anomalies. First let's check the number of anomaly detections per each type.
plt.figure(figsize=(15,4))
ax= sns.barplot(labels, data.iloc[:,36:].sum().values)
plt.title("Detections in each anomaly category")
plt.ylabel('Number of detections')
plt.xlabel('Anomaly Types')
#adding the text labels
rects = ax.patches
text_labels = data.iloc[:,36:].sum().values
for rect, label in zip(rects, text_labels):
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')
plt.show()
We see that the Check6 has most detections with 80,572 logs while the second popular one is Check4 with 24,815 detections. There is not much difference between Check2, Check3 and Check7 with around 7,500 - 8,500 detections. Similar to Check5 and Check8 with around 3000 logs. Check1 occurs least with 1,636 anomalies.
Check if there are any logs with more than one type of anomaly detected
rowSums = data.iloc[:,36:].sum(axis=1)
multiLabel_counts = rowSums.value_counts()
multiLabel_counts = multiLabel_counts.iloc[1:]
plt.figure(figsize=(15,4))
ax = sns.barplot(multiLabel_counts.index, multiLabel_counts.values)
plt.title("Multi-anomaly detections")
plt.ylabel('Number of detections')
plt.xlabel('Number of anomalies')
#adding the text labels
rects = ax.patches
text_labels = multiLabel_counts.values
for rect, label in zip(rects, text_labels):
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')
plt.show()
There are 78,435 single anomaly detections and around 22,000 multi-anomaly detections. These detections are combinations of 2 to 7 types of anomaly. Then our problem is the multi-label problem. We check that if there is any correlation between labels.
#labels correlation
correlation_matrix = data.iloc[:,36:].corr()
fig = plt.figure(figsize=(12,9))
sns.heatmap(correlation_matrix,vmax=0.8,square = True,annot = True)
plt.show()
There is a moderate correlation between Check4 ad Check2. Others are all weak positive correlations.
Our problem is to detect right types of anomaly. In this section, we will exploit the features in two perspectives: anomaly vs non-anomaly behaviors, and how our data are different in different types of anomaly.
Now we will build some functions to visualize data statistic:
sns.set()
def logs_per_f(f, graph_type = "bar", figsize=(15,8), data=data):
'''
The bar graph describes number of anomaly/normal logs per each value of feature f
'''
i = np.where(features == f)[0][0]
# Get the data with f and anomaly info
tmp_f_data = pd.DataFrame()
tmp_f_data['Anomaly'] = data.iloc[:,36:44].max(axis=1)
tmp_f_data[f] = data.iloc[:,i]
tmp_f_data['tmp'] = data.iloc[:,i]
# Display the bar graph of anomaly per f
if(graph_type == "bar"):
tmp_grouped_data = tmp_f_data.pivot_table(index=[f],
columns=['Anomaly'],
values='tmp',
fill_value=0,
aggfunc='count')
ax = tmp_grouped_data.plot.bar(rot=0,figsize=figsize)
plt.title("Logs per %s" % f)
plt.xlabel(f)
else:
if(graph_type == "dist"):
fig, ax = plt.subplots(figsize=figsize)
sns.distplot(tmp_f_data.loc[tmp_f_data["Anomaly"] == 0., f], hist=False, rug=True, label="Normal")
sns.distplot(tmp_f_data.loc[tmp_f_data["Anomaly"] == 1., f], hist=False, rug=True, label="Anomaly")
plt.title("Logs per %s" % f)
plt.xlabel(f)
plt.show()
def anomaly_type_per_f(f, graph_type="bar", figsize=(15,8), data=data):
'''
The bar graph expresses the number of different types of anomaly per each value of feature f
The dist graph expresses the distribution of feature f in different types of anomaly
'''
i = np.where(features == f)[0][0]
tmp_f_data = data.iloc[:,36:44]
tmp_f_data[f] = data.iloc[:,i]
if(graph_type == "bar"):
tmp_f_data = tmp_f_data.groupby(f).sum()
ax = tmp_f_data.plot.bar(rot=0,figsize=figsize)
plt.title("Anomalies types per %s" % f)
plt.xlabel(f)
else:
if(graph_type == "dist"):
fig, ax = plt.subplots(figsize=figsize)
for l in labels:
sns.distplot(tmp_f_data.loc[tmp_f_data[l] == 1., f], hist=False, rug=True, label=l)
plt.title("%s Distributions over Anomalies types" % f)
plt.xlabel(f)
plt.show()
def anomaly_per_f(f, graph_type="bar", top=0, data=data):
'''
The bar graph describes number of anomalies through the range value of feature f
'''
i = np.where(features == f)[0][0]
tmp_f_data = pd.DataFrame()
tmp_f_data['Anomaly'] = data.iloc[:,36:44].max(axis=1)
tmp_f_data[f] = data.iloc[:,i]
anomaly_per_f= (tmp_f_data.groupby(f).sum())
if (graph_type == "dist"):
sns.distplot(anomaly_per_f)
# ax = anomaly_per_f.plot.hist(anomaly_per_f,figsize=(15,8))
plt.title("Anomalies per %s" % f)
# Number of systems which have no anomalies
print("Number of %s which have no anomalies: %s"
% (f, anomaly_per_f.loc[anomaly_per_f['Anomaly'] == 0.].count().values))
else:
if (graph_type == "bar"):
if(top > 0):
anomaly_per_f = anomaly_per_f.nlargest(50, 'Anomaly')
title = "Top " + str(top) + " Anomalies per " + f
else:
title = "Anomalies per " + f
ax = anomaly_per_f.plot.bar(rot=0,figsize=(15,8))
plt.title(title)
plt.show()
def bar_plot(labels, values, title, xlabel, ylabel, size=(15,8)):
sns.set(font_scale = 1)
plt.figure(figsize=size)
ax= sns.barplot(labels, values)
ax.xaxis_date()
plt.title(title, fontsize=18)
plt.ylabel(ylabel, fontsize=18)
plt.xlabel(xlabel, fontsize=18)
#adding the text labels
rects = ax.patches
for rect, label in zip(rects, values):
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom', fontsize=18)
plt.show()
After the first part, we have some ideas about which features play an important role to the anomaly detection, as well as features which have no role at all. Now we do a deeper analysic on features.
First let's check the distribution of all features values except Date:
f = pd.melt(data, value_vars=np.delete(features, 2))
g = sns.FacetGrid(f, col="variable", col_wrap=3, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")
Our problem is to detect anomalies, therefore features with uniform distribution may not bring much values to us.
As we know logs are collected in one year. We will group Date into month/day/hour to see the role of this field on anomaly detection.
# Convert into month
tmp_date = data.iloc[:,2]
# Convert into month
data['Date'] = tmp_date.dt.month
logs_per_f('Date', "bar", figsize=(15,4))
data['Date'] = tmp_date.dt.day
logs_per_f('Date', "bar", figsize=(15,4))
data['Date'] = tmp_date.dt.hour
logs_per_f('Date', "bar", figsize=(15,4))
From these graphs we obserse that the ratio of anomalies to normal behaviors per month, day or hour is all around 50%. There are some periods which have more logs than others but the anomaly rate doesn't change much. For example in Oct and Nov, in the begin of week or at around 4h.
It seems that Date is not a key feature in anomaly detection. However at this step we still keep this feature and use only hour value instead of date.
# Convert into month
data['Date'] = tmp_date.dt.hour
del(tmp_date)
Check the number of anomaly detection on MergeErrors
for f in ["MergeErrors"]:
logs_per_f(f, "bar", figsize=(15,4))
From these graphs we obserse that anomalies are likely detected when there are errors in merge process
Now we check the distribution of features in anomaly/normal logs. SystemID is a categorical feature, however there are more than 3,000 systems then we put it here to easily generate graph.
numerical_features = np.array([
"SystemID",
"HighPriorityAlerts",
"Dumps",
"CompositeOOMDums",
"IndexServerRestarts",
"NameServerRestarts",
"XSEngineRestarts",
"StatisticsServerRestarts",
"CPU",
"PhysMEM",
"InstanceMEM",
"TablesAllocation",
"IndexServerAllocationLimit",
"ColumnUnloads",
"DeltaSize",
"BlockingPhaseSec",
"Disk",
"LargestTableSize",
"LargestPartitionSize",
"DiagnosisFiles",
"DiagnosisFilesSize",
"DaysWithSuccessfulDataBackups",
"DaysWithSuccessfulLogBackups",
"DaysWithFailedDataBackups",
"DaysWithFailedfulLogBackups",
"MinDailyNumberOfSuccessfulDataBackups",
"MinDailyNumberOfSuccessfulLogBackups",
"MaxDailyNumberOfFailedDataBackups",
"MaxDailyNumberOfFailedLogBackups",
"LogSegmentChange",
])
for f in numerical_features:
logs_per_f(f, graph_type="dist", figsize=(15,4))
Check these graphs give us below observations:
These information could be useful when we handle missing data.
for f in ["Date", "MergeErrors", "HighPriorityAlerts"]:
anomaly_type_per_f(f, "bar", figsize=(15,4))
numerical_features = np.array([
"SystemID",
"HighPriorityAlerts",
"Dumps",
"CompositeOOMDums",
"IndexServerRestarts",
"NameServerRestarts",
"XSEngineRestarts",
"StatisticsServerRestarts",
"CPU",
"PhysMEM",
"InstanceMEM",
"TablesAllocation",
"IndexServerAllocationLimit",
"ColumnUnloads",
"DeltaSize",
"BlockingPhaseSec",
"Disk",
"LargestTableSize",
"LargestPartitionSize",
"DiagnosisFiles",
"DiagnosisFilesSize",
"DaysWithSuccessfulDataBackups",
"DaysWithSuccessfulLogBackups",
"DaysWithFailedDataBackups",
"DaysWithFailedfulLogBackups",
"MinDailyNumberOfSuccessfulDataBackups",
"MinDailyNumberOfSuccessfulLogBackups",
"MaxDailyNumberOfFailedDataBackups",
"MaxDailyNumberOfFailedLogBackups",
"LogSegmentChange",
])
for f in numerical_features:
anomaly_type_per_f(f, graph_type="dist", figsize=(15,4))
We see that the order of different anomalies don't change much through months.
# def dist_plot(x1, x2):
# sns.distplot(x1)
# sns.distplot(x2)
# plt.legend(loc='upper right')
# plt.show()
# # for f in features2:
# # print(f, ":")
# # display(data[f].unique())
# # data[f].plot()
# # plt.plot(data[f])
# # plt.show()
# # display(data["PhysMEM"].unique())
# # display(max(data["PhysMEM"]))
# # f = pd.melt(data.loc[data['Error'] == 1.], value_vars=np.delete(features, 2))
# # f = pd.melt(data, id_vars=['Error'], value_vars=np.delete(features, 2))
# # display(f)
# # g = sns.FacetGrid(f, row="variable", col="Error")
# # g = g.map(dist_plot, "value")
# # plt.show()
# # for f in features:
# # g = sns.FacetGrid(data, col="Error", row=f)
# # g = g.map(plt.hist, "total_bill")
# f = pd.DataFrame()
# f_error = pd.melt(data.loc[data['Error'] == 1.], value_vars=np.delete(features, 2))
# f['feature', 'error'] = f_error['variable', 'value']
# f_normal = pd.melt(data.loc[data['Error'] == 0.], value_vars=np.delete(features, 2))
# f1 = pd.DataFrame()
# f1['feature', 'normal', 'error'] = f
# g = sns.FacetGrid(f, col="variable", col_wrap=4, sharex=False, sharey=False)
# g = g.map(sns.distplot, "value")
We check correlation between features and features as well as features and labels
#correlation
correlation_matrix = data.corr()
fig = plt.figure(figsize=(12,9))
sns.heatmap(correlation_matrix,vmax=0.8,square = True)
plt.show()
We observe that there are some strong correlation between:
tmp_f_data = pd.DataFrame()
tmp_f_data['Anomaly'] = data.iloc[:,36:44].max(axis=1)
cols = ['InstanceMEM', 'TablesAllocation', 'IndexServerAllocationLimit']
for c in cols:
tmp_f_data[c] = data[c]
sns.pairplot(tmp_f_data.dropna(), hue="Anomaly", vars=cols, size = 4.5)
plt.show();
The previous step should give you a better understanding of which pre-processing is required for the data. This may include:
# List of features which should be in type of integer
integer_features = np.array([
"SessionNumber",
"SystemID",
"HighPriorityAlerts",
"Dumps",
"CleanupOOMDumps",
"CompositeOOMDums",
"IndexServerRestarts",
"NameServerRestarts",
"XSEngineRestarts",
"PreprocessorRestarts",
"DaemonRestarts",
"StatisticsServerRestarts",
"ColumnUnloads",
"DeltaSize",
"MergeErrors",
"BlockingPhaseSec",
"LargestTableSize",
"LargestPartitionSize",
"DiagnosisFiles",
"DiagnosisFilesSize",
"DaysWithSuccessfulDataBackups",
"DaysWithSuccessfulLogBackups",
"DaysWithFailedDataBackups",
"DaysWithFailedfulLogBackups",
"MinDailyNumberOfSuccessfulDataBackups",
"MinDailyNumberOfSuccessfulLogBackups",
"MaxDailyNumberOfFailedDataBackups",
"MaxDailyNumberOfFailedLogBackups",
"LogSegmentChange",
])
# Cast into integer type
# for f in integer_features:
# data[f] = data[f].astype('int64')
Check how many missing values we have in the data
data["Anomaly"] = data.iloc[:,36:44].max(axis=1)
#missing data
total = data.isnull().sum().sort_values(ascending=False)
percent = (data.isnull().sum()/data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(30)
We will delete all records which have no information about any labels (NaN for all labels)
data = data.dropna(subset=['Anomaly'])
For MergeErrors, if the record is an anomaly we put it as 1 (the most frequently value), 0 otherwise.
f = "MergeErrors"
data[f] = data[f].fillna(data["Anomaly"])
print(data[f].isnull().sum())
For numerical features, if the record is an anomaly we replace it by the median of anomaly values, otherwise by mean of normal values.
nan_features = ['BlockingPhaseSec',
'LogSegmentChange',
'IndexServerAllocationLimit',
'CPU',
'InstanceMEM',
'DiagnosisFiles',
'DiagnosisFilesSize',
'PhysMEM',
'LargestTableSize',
'Disk',
'TablesAllocation',
'DeltaSize',
'LargestPartitionSize',
'Dumps',
'CleanupOOMDumps',
'CompositeOOMDums']
for f in nan_features:
data.loc[data["Anomaly"] == 1. , f] = data.loc[data["Anomaly"] == 1. , f].fillna(data.loc[data["Anomaly"] == 1. , f].median())
data.loc[data["Anomaly"] == 0. , f] = data.loc[data["Anomaly"] == 0. , f].fillna(data.loc[data["Anomaly"] == 0. , f].median())
#missing data
total = data.isnull().sum().sort_values(ascending=False)
percent = (data.isnull().sum()/data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(10)
#missing data
total = data.isnull().sum().sort_values(ascending=False)
percent = (data.isnull().sum()/data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(10)
#missing data
data = data.fillna(0)
Now we will cast integer values into their right type of data
# List of features which should be in type of integer
integer_features = np.array([
"SessionNumber",
"SystemID",
"HighPriorityAlerts",
"Dumps",
"CleanupOOMDumps",
"CompositeOOMDums",
"IndexServerRestarts",
"NameServerRestarts",
"XSEngineRestarts",
"PreprocessorRestarts",
"DaemonRestarts",
"StatisticsServerRestarts",
"ColumnUnloads",
"DeltaSize",
"MergeErrors",
"BlockingPhaseSec",
"LargestTableSize",
"LargestPartitionSize",
"DiagnosisFiles",
"DiagnosisFilesSize",
"DaysWithSuccessfulDataBackups",
"DaysWithSuccessfulLogBackups",
"DaysWithFailedDataBackups",
"DaysWithFailedfulLogBackups",
"MinDailyNumberOfSuccessfulDataBackups",
"MinDailyNumberOfSuccessfulLogBackups",
"MaxDailyNumberOfFailedDataBackups",
"MaxDailyNumberOfFailedLogBackups",
"LogSegmentChange",
])
# Cast into integer type
for f in integer_features:
data[f] = data[f].astype('int64')
Remove some features:
data = data.drop(['SessionNumber', 'CleanupOOMDumps', 'PreprocessorRestarts', 'DaemonRestarts', 'Anomaly'], axis=1)
In this section, we find suitable models and do experiments on them.
Our problem is a multi-label classification with a moderate correlation between 2 of 8 labels. There are 2 main approaches to resolve such problems:
Algorithm adaptation methods: we treat the whole problem with a specific algorithm. It means that each combination of labels becomes a new target.
Problem transformation methods: transform the multi-label problems into multi single-label problems.
For the anomaly detection problem, we choose DecisionTree as the core model to build because its advantages as below:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.metrics import f1_score
# from sklearn.tree import
df = data.dropna(subset=labels)
y = (df[labels]).as_matrix()
df = df.drop(labels, axis=1)
# df = df.drop(['Anomaly'], axis = 1)
display(df[0:5])
display(y[0:5])
X_train, X_test, y_train, y_test = train_test_split(df, y, train_size=0.8, test_size=0.2)
model = DecisionTreeClassifier(max_depth=8, max_features=15)
model.fit(X_train, y_train)
# tree.plot_tree(model.fit(df1, y1))
export_graphviz(model)
print("score: ", model.score(X_test, y_test))
y_pred = model.predict(X_test)
f1 = f1_score(y_test, y_pred, average='macro')
print("f1 score: ", f1)
y = data[labels]
df = data.drop(labels, axis=1)
X_train, X_test, y_train, y_test = train_test_split(df, y, train_size=0.8, test_size=0.2)
import sklearn.metrics
# assume data is loaded using
# and is available in X_train/X_test, y_train/y_test
# initialize Binary Relevance multi-label classifier
# with gaussian naive bayes base classifier
classifier = BinaryRelevance(DecisionTreeClassifier(max_depth=8, max_features=15))
# train
classifier.fit(X_train, y_train)
# predict
predictions = classifier.predict(X_test)
display(predictions[:10].todense())
# measure
print(sklearn.metrics.f1_score(y_test, predictions, average='macro'))
data1 = data
# data1 = data1.drop(['Date','SessionNumber'], axis=1)
data1 = data1.fillna(0)
data1[labels] = data1[labels].fillna(0)
# data1 = data1.dropna()
# print("after drop: ", data1.shape)
# display(data1[28:30])
y1 = (data1[labels]).as_matrix()
# df1 = data1.drop("Date", axis=1)
df1 = data1.drop(labels, axis=1)
display(df1[28:30])
display(y1[20:40])
display(predictions[:100].todense())
from sklearn.metrics import f1_score
from sklearn import tree
tree.export_graphviz(model)
print("score: ", model.score(X_test, y_test))
y_pred = model.predict(X_test)
f1 = f1_score(y_test, y_pred, average='macro')
print("f1 score: ", f1)
display(y_pred[:10])
display(y_test[:10])
from sklearn.tree import export_graphviz
export_graphviz(model)
def tree_analysis(estimator):
n_nodes = estimator.tree_.node_count
children_left = estimator.tree_.children_left
children_right = estimator.tree_.children_right
feature = estimator.tree_.feature
threshold = estimator.tree_.threshold
# print("The binary tree structure has %s nodes, %s children left, %s children right %s feature, %s threshold"
# % (n_nodes, children_left.size, children_right.size, feature.size, threshold))
# The tree structure can be traversed to compute various properties such
# as the depth of each node and whether or not it is a leaf.
node_depth = np.zeros(shape=n_nodes, dtype=np.int64)
is_leaves = np.zeros(shape=n_nodes, dtype=bool)
stack = [(0, -1)] # seed is the root node id and its parent depth
while len(stack) > 0:
node_id, parent_depth = stack.pop()
node_depth[node_id] = parent_depth + 1
# If we have a test node
if (children_left[node_id] != children_right[node_id]):
stack.append((children_left[node_id], parent_depth + 1))
stack.append((children_right[node_id], parent_depth + 1))
else:
is_leaves[node_id] = True
print("The binary tree structure has %s nodes and has "
"the following tree structure:"
% n_nodes)
for i in range(n_nodes):
if is_leaves[i]:
print("%snode=%s leaf node." % (node_depth[i] * "\t", i))
else:
print("%snode=%s test node: go to node %s if X[:, %s] <= %s else to "
"node %s."
% (node_depth[i] * "\t",
i,
children_left[i],
feature[i],
threshold[i],
children_right[i],
))
print()
# First let's retrieve the decision path of each sample. The decision_path
# method allows to retrieve the node indicator functions. A non zero element of
# indicator matrix at the position (i, j) indicates that the sample i goes
# through the node j.
# node_indicator = estimator.decision_path(X_test)
# # Similarly, we can also have the leaves ids reached by each sample.
# leave_id = estimator.apply(X_test)
# # Now, it's possible to get the tests that were used to predict a sample or
# # a group of samples. First, let's make it for the sample.
# sample_id = 0
# node_index = node_indicator.indices[node_indicator.indptr[sample_id]:
# node_indicator.indptr[sample_id + 1]]
# print('Rules used to predict sample %s: ' % sample_id)
# for node_id in node_index:
# if leave_id[sample_id] == node_id:
# continue
# if (X_test[sample_id, feature[node_id]] <= threshold[node_id]):
# threshold_sign = "<="
# else:
# threshold_sign = ">"
# print("decision id node %s : (X_test[%s, %s] (= %s) %s %s)"
# % (node_id,
# sample_id,
# feature[node_id],
# X_test[sample_id, feature[node_id]],
# threshold_sign,
# threshold[node_id]))
# # For a group of samples, we have the following common node.
# sample_ids = [0, 1]
# common_nodes = (node_indicator.toarray()[sample_ids].sum(axis=0) ==
# len(sample_ids))
# common_node_id = np.arange(n_nodes)[common_nodes]
# print("\nThe following samples %s share the node %s in the tree"
# % (sample_ids, common_node_id))
# print("It is %s %% of all nodes." % (100 * len(common_node_id) / n_nodes,))
tree_analysis(model)
Rule-based systems are designed by defining specific rules that describe an anomaly. The decision rule is a simple IF-THEN statement consisting of a condition and a prediction. A single decision rule or a combination of several rules can be used to make predictions. They typically base on the experience of industry experts and are ideal to detect "known anomalies". These known anomalies are familiar to us as we recognize what is normal and what is not.
Decision rules follow a general structure: IF the conditions are met THEN make a certain prediction:
Quality of a classification rule can be evaluated by:
Advantages:
Disadvantages:
There are many ways to learn rules from data. Some of them are:
In this experiment, we choose the RIPPER model which is a variant of the sequential covering algorithm to study. We installed the application Weka to do the experiment. In fact, in Weka the RIPPER model called JRip. It is a basic incremental reduced-error pruning algorithm, based on incremental reduced error pruning (IREP). The main idea of the Sequential covering algorithm: Find a good rule that applies to some of the data points. Remove all data points which are covered by the rule. The goal is creating rules that cover many examples of a class and none or very few of other classs. Repeat the rule-learning and removal of covered points with the remaining points until no more points are left or another stop condition is met. The result is a decision list.

The stop conditions:
RIPPER (Repeated Incremental Pruning to Produce Error Reduction) is a variant of the Sequential Covering algorithm. RIPPER is a bit more sophisticated and uses a post-processing phase (rule pruning) to optimize the decision list (or set). RIPPER can run in ordered or unordered mode and generate either a decision list or decision set.
data.to_csv('anomaly_detection.csv',index=False)
We split 80% data for training and remain 20% data for testing. We train to classify for each class and these are results:
Check 1:
Select 5 best examples:
=> Check1=0 (285325.0/3.0)
*The numbers in the bracket stand for positive/negative instance for the rule.
We can see that the results are really good with high Correctly Classified Instances and low Root mean squared error. The time taken to build model is quite quick from 39 to 899 seconds. Corresponding decision rules produces exactly the same predictions with the decision tree. Rule sets can be more perspicuous. Base on the rule list we can understand the errors relevant each class:
We can tune these parameters to get a best model:
Irrespective of your choice, it is highly likely that your model will have one or more parameters that require tuning. There are several techniques for carrying out such a procedure, including cross-validation, Bayesian optimisation, and several others. As before, an analysis into which parameter tuning technique best suits your model is expected before proceeding with the optimisation of your model.
Some form of pre-evaluation will inevitably be required in the preceding sections in order to both select an appropriate model and configure its parameters appropriately. In this final section, you may evaluate other aspects of the model such as:
For the evaluation of the classification results, you should use F1-score for each class and do the average.
N.B. Please note that you are responsible for creating a sensible train/validation/test split. There is no predefined held-out test data.
As you will see in the dataset description, the labels you are going to predict have no meaningful names. Try to understand which kind of anomalies these labels refer to and give sensible names. To do it, you could exploit the output of the interpretable models and/or use a statistical approach with the data you have.
The data for this challenge is located at: /mnt/datasets/anomaly
You have a unique csv file with 36 features and 8 labels. Each record contains aggregate features computed over a given amount of time.
A brief outline of the available attributes is given below.
Labels are binary. Each label refers to a different anomaly.